14 research outputs found

    Guarded Policy Optimization with Imperfect Online Demonstrations

    Full text link
    The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.Comment: Accepted at ICLR 2023 (top 25%

    MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning

    Full text link
    Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the safety awareness of the surrounding traffic, and the decision-making in complex multi-agent settings. Despite the great success of Reinforcement Learning (RL), most of the RL research works investigate each capability separately due to the lack of integrated environments. In this work, we develop a new driving simulation platform called MetaDrive to support the research of generalizable reinforcement learning algorithms for machine autonomy. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real data importing. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. The generalization experiments conducted on both procedurally generated scenarios and real-world scenarios show that increasing the diversity and the size of the training set leads to the improvement of the generalizability of the RL agents. We further evaluate various safe reinforcement learning and multi-agent reinforcement learning algorithms in MetaDrive environments and provide the benchmarks. Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive . More research projects based on MetaDrive simulator are listed at https://metadriverse.github.ioComment: Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive . More research projects based on MetaDrive simulator are listed at https://metadriverse.github.i

    PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

    Full text link
    Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Though promising, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, recommender systems with human preferences (or Preference-based Recommender systems), which allows RL recommender systems to learn from preferences about users historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. We conduct experiments on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks

    State Regularized Policy Optimization on Data with Dynamics Shift

    Full text link
    In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.Comment: Preprint. Under Revie

    AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term User Engagement

    Full text link
    Growing attention has been paid to Reinforcement Learning (RL) algorithms when optimizing long-term user engagement in sequential recommendation tasks. One challenge in large-scale online recommendation systems is the constant and complicated changes in users' behavior patterns, such as interaction rates and retention tendencies. When formulated as a Markov Decision Process (MDP), the dynamics and reward functions of the recommendation system are continuously affected by these changes. Existing RL algorithms for recommendation systems will suffer from distribution shift and struggle to adapt in such an MDP. In this paper, we introduce a novel paradigm called Adaptive Sequential Recommendation (AdaRec) to address this issue. AdaRec proposes a new distance-based representation loss to extract latent information from users' interaction trajectories. Such information reflects how RL policy fits to current user behavior patterns, and helps the policy to identify subtle changes in the recommendation system. To make rapid adaptation to these changes, AdaRec encourages exploration with the idea of optimism under uncertainty. The exploration is further guarded by zero-order action optimization to ensure stable recommendation quality in complicated environments. We conduct extensive empirical analyses in both simulator-based and live sequential recommendation tasks, where AdaRec exhibits superior long-term performance compared to all baseline algorithms.Comment: Preprint. Under Revie

    Gene Expression Profiles Deciphering Rice Phenotypic Variation between Nipponbare (Japonica) and 93-11 (Indica) during Oxidative Stress

    Get PDF
    Rice is a very important food staple that feeds more than half the world's population. Two major Asian cultivated rice (Oryza sativa L.) subspecies, japonica and indica, show significant phenotypic variation in their stress responses. However, the molecular mechanisms underlying this phenotypic variation are still largely unknown. A common link among different stresses is that they produce an oxidative burst and result in an increase of reactive oxygen species (ROS). In this study, methyl viologen (MV) as a ROS agent was applied to investigate the rice oxidative stress response. We observed that 93-11 (indica) seedlings exhibited leaf senescence with severe lesions under MV treatment compared to Nipponbare (japonica). Whole-genome microarray experiments were conducted, and 1,062 probe sets were identified with gene expression level polymorphisms between the two rice cultivars in addition to differential expression under MV treatment, which were assigned as Core Intersectional Probesets (CIPs). These CIPs were analyzed by gene ontology (GO) and highlighted with enrichment GO terms related to toxin and oxidative stress responses as well as other responses. These GO term-enriched genes of the CIPs include glutathine S-transferases (GSTs), P450, plant defense genes, and secondary metabolism related genes such as chalcone synthase (CHS). Further insertion/deletion (InDel) and regulatory element analyses for these identified CIPs suggested that there may be some eQTL hotspots related to oxidative stress in the rice genome, such as GST genes encoded on chromosome 10. In addition, we identified a group of marker genes individuating the japonica and indica subspecies. In summary, we developed a new strategy combining biological experiments and data mining to study the possible molecular mechanism of phenotypic variation during oxidative stress between Nipponbare and 93-11. This study will aid in the analysis of the molecular basis of quantitative traits

    Reliability and validity of the international dementia alliance schedule for the assessment and staging of care in China

    No full text
    Abstract Background Clinical and social services both are important for dementia care. The International Dementia Alliance (IDEAL) Schedule for the Assessment and Staging of Care was developed to guide clinical and social care for dementia. Our study aimed to assess the validity and reliability of the IDEAL schedule in China. Methods Two hundred eighty-two dementia patients and their caregivers were recruited from 15 hospitals in China. Each patient-caregiver dyad was assessed with the IDEAL schedule by a rater and an observer simultaneously. The Clinical Dementia Rating (CDR), Mini-Mental Status Examination (MMSE), and Caregiver Burden Inventory (CBI) were assessed for criterion validity. IDEAL repeated assessment was conducted 7-10Β days after the initial interview for 62 dyads. Results Two hundred seventy-seven patient-caregiver dyads completed the IDEAL assessment. Inter-rater reliability for the total score of the IDEAL schedule was 0.93 (95%CI = 0.92-0.95). The inter-class coefficient for the total score of IDEAL was 0.95 for the interviewers and 0.93 for the silent raters. The IDEAL total score correlated with the global CDR score (ρ = 0.72, p < 0.001), the CDR-sum of box (CDR-SOB, ρ = 0.74, p < 0.001), the total score of MMSE (ρ =β€‰βˆ’0.65, p < 0.001) and CBI (ρ = 0.70, p < 0.001). All item scores of the IDEAL schedule were associated with the CDR-SOB (ρ = 0.17 ~ 0.79, all p < 0.05). Conclusion The IDEAL schedule is a valid and reliable tool for the staging of care for dementia in the Chinese population
    corecore